I chose to use the Prosper data set. Although it is large, the data is most interesting to me. I have a background in finance, and I love personal finance. I enjoy helping people understand their finances, set budgets and improve their credit scores, since these steps have helped me out in life so so much. I personally am an investor at Prosper, so I thought it would be interesting to try to predict the score Prosper gives a potential loan (‘ProsperScore’), or try to predict a loan’s APR.
## BorrowerAPR ProsperScore EmploymentStatus
## Min. :0.04583 Min. : 1.000 Employed :66586
## 1st Qu.:0.16361 1st Qu.: 4.000 Full-time : 7926
## Median :0.21945 Median : 6.000 Self-employed: 4456
## Mean :0.22694 Mean : 5.953 Other : 3742
## 3rd Qu.:0.29254 3rd Qu.: 8.000 Not employed : 649
## Max. :0.42395 Max. :11.000 Retired : 367
## (Other) : 256
## IsBorrowerHomeowner OpenRevolvingMonthlyPayment RevolvingCreditBalance
## False:39560 Min. : 0.0 Min. : 0
## True :44422 1st Qu.: 155.0 1st Qu.: 3804
## Median : 310.0 Median : 9310
## Mean : 430.4 Mean : 17936
## 3rd Qu.: 563.0 3rd Qu.: 20335
## Max. :13765.0 Max. :999165
##
## BankcardUtilization DebtToIncomeRatio StatedMonthlyIncome
## Min. :0.000 Min. : 0.000 Min. : 0
## 1st Qu.:0.330 1st Qu.: 0.150 1st Qu.: 3427
## Median :0.600 Median : 0.220 Median : 5000
## Mean :0.564 Mean : 0.259 Mean : 5931
## 3rd Qu.:0.830 3rd Qu.: 0.320 3rd Qu.: 7083
## Max. :2.500 Max. :10.010 Max. :1750003
## NA's :7214
## LoanOriginalAmount CreditScore
## Min. : 1000 Min. :609.5
## 1st Qu.: 4000 1st Qu.:669.5
## Median : 7500 Median :709.5
## Mean : 9061 Mean :709.0
## 3rd Qu.:13500 3rd Qu.:729.5
## Max. :35000 Max. :889.5
##
My data set consists of eleven variables, two of which I will be trying to predict (ProsperScore or BorrowerAPR). The other nine variables are facts about a customer’s credit and lifestyle. One variable, EmploymentStatus, is factored with nine levels. Another variable, IsBorrowerHomeowner is also factored, but with only a True or False value. The remaining seven variables are all continuous variables. Five of these variables (OpenRevolvingMonthlyPayment, RevolvingCreditBalance, BankcardUtilization, DebtToIncomeRatio and StatedMonthlyIncome) have very large outliers, so I’m going to keep that in mind as I move on.
This histogram of ProsperScore shows that it is fairly normal distributed, with discrete values.
The histogram for BorrowerAPR is also fairly normally distributed, with a large spike around 0.36/0.37, so that is a bit curious. There is also a spike at 0.29.
This histogram of CreditScore is normal, with a right skewness, indicating that there may be outliers on the larger side of CreditScore that are pulling the mean. Most CreditScores are between 650 and 750.
Most borrowers in the data set are Employed.
RevolvingCreditBalance was one of the variables with larger outliers that I noticed from the Summary table (looking at the 3rd Quartile value vs. the Max, they are very spread apart). Limiting the x axis to the 95th quantile cuts off many of those large outliers, bringing the max down from 999,165 to about 60,000. Much cleaner and easier to read.
Also, taking the log10 of RevolvingCreditBalance made the data much more normally distributed (slightly left-skewed), so I’m going to keep that in mind for future analysis as well.
Observing this GGPairs plots, and keeping in mind I’d like to predict ProsperScore or BorrowerAPR, what do I want to explore more? Well, looking at the correlation coefficients, the variable most correlated with ProsperScore is BorrowerAPR. This makes sense because both of these variables are the outcome of the analysis that Prosper does on an individual. Next most correlated with ProsperScore is CreditScore.
The strongest correlation of all my variables pairs RevolvingCreditBalance and OpenRevolvingMonthlyPayment. This is understandable because the higher the credit balance is, the higher the monthly payment should be (perhaps a lower monthly payment for someone with a higher CreditScore? <- something to explore).
The first relationship I looked at was ProsperScore vs BorrowerAPR. They were indeed correlated, although not as strongly as I would have assumed. There is quite a lot of data at the 0.36 APR line (36%), which pulls the line of best fit higher than I would expect.
Second, I looked at ProsperScore vs CreditScore. Again, these two variables were correlated, and when I get to making a model to predict ProsperScore, I’m sure CreditScore will be included. The variance of the data was larger than I was expecting, as I can many borrowers with credit scores in the range of 640 to 800 all received ProsperScores of 8, for example.
Next I looked at ProsperScore vs log(RevolvingCreditBalance), and didn’t find much correlation. Unfortunately this particular analysis didn’t add any information to my investigation.
The two variables with the highest correlation on my GGPairs plots was RevolvingCreditBalance vs OpenRevolvingMonthlyPayment. Both of these variables have large outliers, so I limited the plot to only the 95th percentile for each variable. The plot shows a strong positive correlation, which makes sense when I think about what each variable represents. The higher a borrower’s credit balance is, the higher (I would assume) their monthly payments on that balance would be.
Lastly, expects say that 30% of a person’s credit score is determined by the amount of outstanding debt that they currently have, so I was curious to see if BankcardUtilization has an impact on a person’s ProsperScore. I limited y to 1 because I’m not sure that it’s possible to have over 100% utilization, and if so, those would be outliers anyways. Unfortunately, I didn’t see a strong correlation, but there was a faint negative trend to the data. As a borrower’s utilization went up, their ProsperScore went down, in general.
The first thing I looked at was ProsperScore vs CreditScore, faceted by whether or not the borrower was a homeowner. I noticed that the line of best fit was slightly more steep for homeowners, meaning that the correlation between ProsperScore and CreditScore was slightly stronger for them. Also, I noticed that, in general, homeowners had a wider range of ProsperScores, filling up more of the higher ProsperScores than non-homeowners.
Next, this plot is a funfetti plot, if I’ve ever seen one. Again, I looked at ProsperScore vs CreditScore, this time coloring the points by a borrower’s income range (which I bucketed, to make the plot easier to interpret). There is a clear correlation with income range; I can detect a slight correlation of higher incomes with higher Prosper and Credit Scores, and lower incomes with lower Prosper and Credit Scores (the plot is more green/blue in the top right, and slightly more red/orange in the bottom left). But, this plot is very overplotted, even using 60% transparency, so I’d like to take a look at just a sample of the data.
To get a more clear picture, I took a random sample of 5,000 data points and created a new data frame from those. Then I replotted the same graph with my random sample, to see if I could gather any trends from this smaller data set. It seems about the same, with a lower income slightly correlated to a lower Credit and Prosper Score, and vice verse.
Lastly, since my strongest correlation was between OpenRevolvingMonthlyPayment and RevolvingCreditBalance, I wanted to see if CreditScore was a factor in the correlation. I plotted them, with CreditScore going from blue to red, and as you can see, the plot is VERY overplotted and very hard to interpret because of this.
So, I used the sample data frame I made earlier and plotted that instead. This plot is much easier to read, and I can see that the correlation between OpenRevolvingMonthlyPayment and RevolvingCreditBalance is much stronger with lower values, and becomes much more dispersed with higher values of each variables. Unfortunately, there is no discernible pattern to CreditScore on this data.
##
## Calls:
## m1: lm(formula = ProsperScore ~ CreditScore, data = dfsubset)
## m2: lm(formula = ProsperScore ~ CreditScore + log10(DebtToIncomeRatio +
## 0.01), data = dfsubset)
## m3: lm(formula = ProsperScore ~ CreditScore + log10(DebtToIncomeRatio +
## 0.01) + IsBorrowerHomeowner, data = dfsubset)
## m4: lm(formula = ProsperScore ~ CreditScore + log10(DebtToIncomeRatio +
## 0.01) + IsBorrowerHomeowner + EmploymentStatus, data = dfsubset)
## m5: lm(formula = ProsperScore ~ CreditScore + log10(DebtToIncomeRatio +
## 0.01) + IsBorrowerHomeowner + EmploymentStatus + log10(BankcardUtilization +
## 0.01), data = dfsubset)
##
## ============================================================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------------------------------------
## (Intercept) -7.246*** -8.816*** -9.095*** -9.165*** -10.006***
## (0.115) (0.114) (0.116) (0.115) (0.122)
## CreditScore 0.019*** 0.019*** 0.019*** 0.019*** 0.021***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## log10(DebtToIncomeRatio + 0.01) -2.566*** -2.552*** -2.476*** -2.627***
## (0.031) (0.031) (0.030) (0.031)
## IsBorrowerHomeowner: True/False -0.183*** -0.200*** -0.263***
## (0.016) (0.016) (0.016)
## EmploymentStatus: Full-time/Employed 0.996*** 1.025***
## (0.025) (0.025)
## EmploymentStatus: Not employed/Employed 3.719 3.742
## (2.053) (2.048)
## EmploymentStatus: Other/Employed -0.500*** -0.479***
## (0.036) (0.036)
## EmploymentStatus: Part-time/Employed 1.316*** 1.404***
## (0.146) (0.146)
## EmploymentStatus: Retired/Employed 0.494*** 0.563***
## (0.115) (0.115)
## EmploymentStatus: Self-employed/Employed -1.839*** -1.841***
## (0.317) (0.316)
## log10(BankcardUtilization + 0.01) 0.354***
## (0.017)
## ----------------------------------------------------------------------------------------------------------------------------
## R-squared 0.137 0.222 0.223 0.243 0.247
## adj. R-squared 0.137 0.222 0.223 0.243 0.247
## sigma 2.205 2.082 2.080 2.053 2.048
## F 13324.565 10927.496 7342.991 2733.551 2514.873
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -185586.103 -165213.415 -165145.647 -164158.035 -163951.410
## Deviance 408458.057 332665.313 332078.507 323643.182 321905.658
## AIC 371178.206 330434.829 330301.294 328338.070 327926.820
## BIC 371206.221 330471.823 330347.537 328439.804 328037.802
## N 83982 76768 76768 76768 76768
## ============================================================================================================================
The last step I took was to build a model with the five mostly likely predictors of ProsperScore. Even with these five variables, the R squared value is only 0.247, so I would not feel confident saying this is a strong model to predict a borrower’s ProsperScore.
I chose this plot as my first graphic because I found several variables that had large outliers (including RevolvingCreditBalance) and those outliers pulled the data and made them very right skewed. By taking the log 10 of these variables, it made their distributions much easier to work with.
I chose this plot because I used this relationship as a basis for predicting ProsperScore. Though the relationship is not perfectly linear, CreditScore was the variable with the highest correlation to ProsperScore.
I chose this plot as my third most interesting plot. I changed the colors from a funfetti, to a sliding scale from lavender to red. Though there isn’t a STRONG correlation between ProsperScore and CreditScore, with this plot I can definitely tell there is a correlation between those two variables and Income. The top right of the plot is much more red, and the bottom left of the plot is much more lavender.
In this exploration, I didn’t find the strong relationships that I was hoping to find, but was pleasantly surprised to find some correlations. I struggled in the beginning, narrowing down this large data set; deciding what to keep and what to toss was difficult. Perhaps in the future, a better model could be made to predict ProsperScore with more of the variables that I discarded. Of course, Prosper has their algorithm, which uses some, if not all, of these variables.